Goto

Collaborating Authors

 binary classification task


TART: A plug-and-play Transformer module for task-agnostic reasoning

Neural Information Processing Systems

Large language models (LLMs) exhibit in-context learning abilities which enable the same model to perform several tasks without any task-specific training. In contrast, traditional adaptation approaches, such as fine-tuning, modify the underlying models for each specific task. In-context learning, however, consistently underperforms task-specific tuning approaches even when presented with the same examples. While most existing approaches (e.g., prompt engineering) focus on the LLM's learned representations to patch this performance gap, our experiments actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM's reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks. This raises an intriguing question: Are LLMs actually capable of learning how to reason in a task-agnostic manner? We answer this in the affirmative and, as a proof of concept, propose TART which generically improves an LLM's reasoning abilities using a synthetically trained reasoning module.


Probing the Decision Boundaries of In-context Learning in Large Language Models

Neural Information Processing Systems

Recent language models, such as GPT -3+ [Brown et al., 2020, Achiam et al., 2023], have demonstrated Recent attempts to understand in-context learning have focused on various aspects. On the practical side, research has investigated the impact of different factors on in-context learning.






Does the Model Say What the Data Says? A Simple Heuristic for Model Data Alignment

arXiv.org Artificial Intelligence

In this work, we propose a simple and computationally efficient framework for evaluating whether machine learning models align with the structure of the data they learn from; that is, whether the model says what the data says. Unlike existing interpretability methods that focus exclusively on explaining model behavior, our approach establishes a baseline derived directly from the data itself. Drawing inspiration from Rubin's Potential Outcomes Framework, we quantify how strongly each feature separates the two outcome groups in a binary classification task, moving beyond traditional descriptive statistics to estimate each feature's effect on the outcome. By comparing these data-derived feature rankings with model-based explanations, we provide practitioners with an interpretable and model-agnostic method for assessing model-data alignment.


LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics

arXiv.org Artificial Intelligence

Large language models (LLMs) demonstrate remarkable potential across diverse language-related tasks, yet whether they capture deeper linguistic properties--such as syntactic structure, phonetic cues, and metrical patterns--from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel; drama vs. poetry; drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.


Analog Physical Systems Can Exhibit Double Descent

arXiv.org Artificial Intelligence

An important component of the success of large AI models is double descent, in which networks avoid overfitting as they grow relative to the amount of training data, instead improving their performance on unseen data. Here we demonstrate double descent in a decentralized analog network of self-adjusting resistive elements. This system trains itself and performs tasks without a digital processor, offering potential gains in energy efficiency and speed -- but must endure component non-idealities. We find that standard training fails to yield double descent, but a modified protocol that accommodates this inherent imperfection succeeds. Our findings show that analog physical systems, if appropriately trained, can exhibit behaviors underlying the success of digital AI. Further, they suggest that biological systems might similarly benefit from over-parameterization.


Benchmarking Quantum Kernels Across Diverse and Complex Data

arXiv.org Artificial Intelligence

Quantum kernel methods have shown promise and are gaining growing use among quantum machine learning approaches to enhance the performance of kernel-based models, where support vector machines (SVMs) are a common example [1]. They have been applied to various machine learning tasks, such as classification of medical data or high-energy physics [2, 3]. An advanced enhancement to these kernel methods is the trainable quantum kernel, which employs a parameterized quantum circuit (PQC), often referred to as an ansatz. Here, a quantum circuit's gate operations are controlled by a set of externally optimized classical parameters [4, 5]. This enables the quantum kernel to be trained and adapted to the specific structure of a dataset [6]. However, despite theoretical promise, the practical deployment of quantum kernel methods is still in its very early stages. Many research studies focus on a single specific machine learning area with a few dataset samples, but an evaluation of the performance of a quantum kernel across diverse domains remains unverified, whereas this ability is common in classical kernel methods such as the linear kernel or Radial Basis Function (RBF) kernel [7]. This makes it difficult to understand the characteristics of the methods' performance from a comprehensive perspective. Furthermore, existing practice is primarily conducted on low-dimensional synthetic or introductory datasets like variants of MNIST or Iris, or aggressively reduced real-world data that goes from hundreds or more to around ten features [8-10], leaving a large gap in its application to real-world machine learning scenarios.